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This report is a long version of our paper entitled A PAC-Bayesian Approach for Domain Adaptation 
with Specialization to Linear Classifiers published in the proceedings of the International Conference on 
Machine Learning (ICML) 2013. We improved our main results, extended our experiments, and proposed 

an extension to multisource domain adaptation. 

Abstract 

In this paper, we provide two main contributions in PAC-Bayesian theory for domain adaptation 
where the objective is to learn, from a source distribution, a well-performing majority vote on a 
different target distribution. On the one hand, we propose an improvement of the previous approach 
proposed by Germain et al. (2013), that relies on a novel distribution pseudodistance based on a 
disagreement averaging, allowing us to derive a new tighter PAC-Bayesian domain adaptation bound 
for the stochastic Gibbs classifier. We specialize it to linear classifiers, and design a learning algorithm 
which shows interesting results on a synthetic problem and on a popular sentiment annotation task. 

On the other hand, we generalize these results to multisource domain adaptation allowing us to take 
into account different source domains. This study opens the door to tackle domain adaptation tasks 
by making use of all the PAC-Bayesian tools. 


1 Introduction 

As human beings, we learn from what we saw before. Think about our education process: when a student 
attends to a new course, he has to make use of the knowledge he acquired during previous courses. 
However, in machine learning the most common assumption is based on the fact that the learning and 
test data are drawn from the same probability distribution. This strong assumption may be clearly 
irrelevant for a lot of real tasks including those where we desire to adapt a model from one task to another 
one. For instance, a spam filtering system suitable for one user can be poorly adapted to another who 
receives significantly different emails. In other words, the learning data associated with one or several 
users could be unrepresentative of the test data coming from another one. This enhances the need to 
design methods for adapting a classifier from learning (source) data to test (target) data. One solution 
to tackle this issue is to consider the domain adaptation framework^, which arises when the distribution 
generating the target data (the target domain) differs from the one generating the source data (the source 
domain). In such a situation, it is well known that domain adaptation is a hard and challenging task even 
under strong assumptions (Ben-David and Urner, 2012; Ben-David et ah, 2010b; Ben-David and Urner, 
2014). Note that domain adaptation with learning data coming from different source domains is referred 
to as multisource or multiple sources domain adaptation (Crammer et ah, 2007; Mansour et ah, 2009c; 
Ben-David et ah, 2010a). 

^See the surveys proposed by Jiang (2008); Quionero-Candela et al. (2009); and Margolis (2011). 
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Among the existing approaches in the literature to address domain adaptation, the instance weighting- 
based methods allow one to deal with the covariate-shift problem {e.g., Huang et ah, 2006; Sugiyama et ah, 
2008), where source and target domains diverge only in their marginals, i.e., they share the same labeling 
function. Another technique is to exploit self-labeling procedures, where the objective is to transfer the 
source labels to the target unlabeled points {e.g., Bruzzone and Marconcini (2010); Habrard et al. (2013); 
Morvant (2014). A third solution is to learn a new common representation from the unlabeled part of 
source and target data. Then, a standard supervised learning algorithm can be executed on the source 
labeled instances {e.g., Glorot et al. (2011); Chen et al. (2012)). 

The work presented in this paper stands into a popular class of approaches, which relies on a distance 
between the source distribution and the target distribution. Such distance depends on the set H of 
hypotheses (or classifiers) considered by the learning algorithm. The intuition behind this approach is 
that one must look for a set H that minimizes the distance while preserving good performances on the 
source data; if the distributions are close under this measure, then generalization ability may be “easier” 
to quantify. In fact, defining such a measure to quantify how much the domains are related is a major 
issue in domain adaptation. For example, in the context of binary classification with the 0-1 loss function, 
Ben-David et al. (2010a); and Ben-David et al. (2006) have considered the 4^'H-divergence between the 
marginal distributions. This quantity is based on the maximal disagreement between two classihers, 
allowing them to deduce a domain adaptation generalization bound based on the VC-dimension theory. 
The discrepancy distance proposed by Mansour et al. (2009a) generalizes this divergence to real-valued 
functions and more general losses, and is used to obtain a generalization bound based on the Rademacher 
complexity. In this context, Gortes and Mohri (2011, 2014) have specialized the minimization of the 
discrepancy to regression with kernels. In these situations, domain adaptation can be viewed as a multiple 
trade-off between the complexity of the hypothesis class H, the adaptation ability of H according to the 
divergence between the marginals, and the empirical source risk. Moreover, other measures have been 
exploited under different assumptions, such as the Renyi divergence suitable for importance weighting 
(Mansour et al., 2009b), or the measure proposed by G. Zhang (2012) which takes into account the source 
and target true labeling, or the Bayesian “divergence prior” (Li and Bilmes, 2007) which favors classifiers 
closer to the best source model. However, a majority of methods prefer to perform a two-step approach: 
(i) first construct a suitable representation by minimizing the divergence, then (ii) learn a model on the 
source domain in the new representation space. 

The novelty of our contribution is to explore the PAC-Bayesian framework to tackle domain adap¬ 
tation in a binary classification situation without target labels (sometimes called unsupervised domain 
adaptation). Given a prior distribution over a family of classifiers H, PAC-Bayesian theory (introduced 
by McAllester, 1999) focuses on algorithms that output a posterior distribution p over H {i.e., a p-average 
over V.) rather than just a single classifier h G H. Following this principle, we propose a pseudometric 
which evaluates the domain divergence according to the p-average disagreement of the classifiers over the 
domains. This disagreement measure shows many advantages. First, it is ideal for the PAC-Bayesian set¬ 
ting, since it is expressed as a p-average over H. Second, we prove that it is always lower than the popular 
"HAH-divergence. Last but not least, our measure can be easily estimated from samples. Indeed, based 
on this disagreement measure, we derived in a previous work (Germain et al., 2013) a first PAC-Bayesian 
domain adaptation bound expressed as a p-averaging. In this paper, we provide a new version of this 
result, that does not change the philosophy supported by the previous bound, but clearly improves the 
theoretical result: The domain adaptation bound is now tighter and easier to interpret. Thanks to this 
new result, we also derive^ three new PAC-Bayesian domain adaptation generalization bounds. Then, in 
contrast to the majority of methods that perform a two-step procedure, we design an algorithm tailored to 
linear classifiers, called PBDA, which jointly minimizes the multiple trade-offs implied by the bounds. The 
first two quantities being, as usual in the PAC-Bayesian approach, the complexity of the majority vote 
measured by a Kullback-Leibler divergence and the empirical risk measured by the p-average errors on 
the source sample. The third quantity corresponds to our domain divergence and assesses the capacity of 
the posterior distribution to distinguish some structural difference between the source and target samples. 
Finally, we extend our results to domain adaptation with multiple sources by considering a mixture of 
different source domains as done by Ben-David et al. (2010a). 

The rest of the paper is structured as follows. Section 2 deals with two seminal works on domain 
adaptation. The PAC-Bayesian framework is then recalled in Section 3. Note that for the sake of com¬ 
pleteness, we provide for the first time the explicit derivation of the algorithm PBGD3 (Germain et al., 
2009a) tailored to linear classifiers in supervised learning. Our main contribution, which consists in a 

^In this paper, we were very keen to improve the readability of our proofs, particularly those provided by Germain et al. 
(2013) as supplementary material. The proof techniques may be of independent interest. 
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domain adaptation bound suitable for PAC-Bayesian learning, is presented in Section 4. Then, we derive 
our new algorithm for PAC-Bayesian domain adaptation in Section 5, that we experiment in Section 6. 
Afterwards, we generalize this analysis to multisource domain adaptation in Section 7. Before concluding 
in Section 9, we discuss two important points in Section 8: (i) two different results for the multisource 
setting that imply open-questions for deriving new algorithms, and (ii) the comparison between our new 
result and the one provided in Germain et al. (2013). 


2 Domain Adaptation Related Works 

In this section, we review the two seminal works in domain adaptation that are based on a divergence 
measure between the domains (Ben-David et ah, 2010a; Ben-David et ah, 2006; Mansour et ah, 2009a). 


2.1 Notations and Setting 

We consider domain adaptation for binary classification tasks where A C is the input space of dimension 
d and Y = { —1,4-1} is the label set. The source domain Ps and the target domain Pt are two different 
distributions over X x Y (unknown and fixed), Ds and being the respective marginal distributions 
over X. We tackle the challenging task where we have no target labels. A learning algorithm is then 
provided with a labeled source sample S = {(xf, y®)}™ consisting of m examples drawn i.i.d.^ from Ps, 
and an unlabeled target sample T = consisting of m' examples drawn i.i.d. from Dt- Note that, 

we denote the distribution of a m-sample by {Ps)'^- We suppose that H is a set of hypothesis functions 
for X to Y. The expected source error and the expected target error oi h £ % over Ps, respectively Pt, 
are the probability that h errs on the entire distribution Ps, respectively Pt, 


Rpsih) = 


E 


'Ps 


>^ 0-1 (^(^*)-2/*) - and Rp^ih) = 


E 

,!/*)'■ 


-Pt 




where {a, b) = I[a ^ b] is the 0-1 loss function which returns 1 if a ^ 6 and 0 otherwise. The empirical 
source error Rs{') on the learning sample S is 

Rsih) 

The main objective in domain adaptation is then to learn—without target labels—a classifier h G R 
leading to the lowest expected target error Rp^{h). 


We also introduce the expected source disagreement Rp)g(h,h') and the expected target disagreement 
RDr^{h,h') of {h',h) e R^, which measure the probability that two classifiers h and h! do not agree on 
the respective marginal distributions, and are defined by 


RDs{h,h') 


-r^Ds 


and RT)^{h,h') = 


.E /:oj(h(x‘),/i'(x‘)) . 

- r^U’-p 


The empirical source disagreement h!) on S and the empirical target disagreements h') on T 

are 

Rs{h,h') = ^ ^ /:o_,(/i(x®),/i'(x")) and RT{h,h!) = ^ ^ (/i(x*), h'(x*)) . 

x^gS x*GT 

Note that, depending on the context, S denotes either the source labeled sample {(xf, ?/|)}™ or its 
unlabeled part {xf}™;^. 

Note also that the expected error Rp{h) on a distribution P can be viewed as a shortcut notation for 
the expected disagreement between a hypothesis h and a labeling function fp that assigns the true label 
to an example description according with respect to P. We have 

Rp{h) = RD{h,fp) = ^E^/:o_i(h(x),/p(x)) , 


where D is the marginal distribution of P over X. 

^i.i.d. stands for independent and identically distributed. 
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2.2 Necessity of a Domain Divergence 

The domain adaptation objective is to find a low-error target hypothesis, even if the target labels are not 
available. Even under strong assumptions, this task can be impossible to solve (Ben-David and Urner, 
2012; Ben-David et ah, 2010b). However, for deriving generalization ability in a domain adaptation 
situation (with the help of a domain adaptation bound), it is critical to make use of a divergence between 
the source and the target domains: the more similar the domains, the easier the adaptation appears. 
Some previous works have proposed different quantities to estimate how a domain is close to another 
one (C. Zhang, 2012; Ben-David et ah, 2010a; Mansour et ah, 2009a,b; Ben-David et ah, 2006; Li and 
Bilmes, 2007). Concretely, two domains Ps and Pt differ if their marginals Ds and Dt are different, 
or if the source labeling function differs from the target one, or if both happen. This suggests taking 
into account two divergences: one between Ds and Dt and one between the labeling. If we have some 
target labels, we can combine the two distances as C. Zhang (2012). Otherwise, we preferably consider 
two separate measures, since it is impossible to estimate the best target hypothesis in such a situation. 
Usually, we suppose that the source labeling function is somehow related to the target one, then we look 
for a representation where the marginals Dg and Dt appear closer without losing performances on the 
source domain. 

2.3 Domain Adaptation Bounds for Binary Classification 

We now review the first two seminal works which propose domain adaptation bounds based on a marginal 
divergence. 

First, under the assumption that there exists a hypothesis in % that performs well on both the source 
and the target domain, Ben-David et al. (2010a); and Ben-David et al. (2006) have provided the following 
domain adaptation bound. 

Theorem 1 (Ben-David et al. (2010a); Ben-David et al. (2006)). Let H be a (symmetric^) hypothesis 
elass. We have 

'ihG'H, Rp^{h) < Rps{h) + ^d-HAniDs-iDx) + P-h* , (1) 

where 

\d-HA'H{DS 1 Dt) '= sup |i?i:i,j,(h, h') — i?_Dg(h,/i')| 

(h,h')eH^ 

is the T-Ddi-distance between the marginals Ds and Dt, and 

= Rp,{h*) + RpAh*) 

is the error of the best hypothesis overall, denoted h*, and defined by 

h* = argmin (i?pg(h) -I- i?p,j,(/i)) . 
h^n 

This bound depends on four terms. Rp^ {h) is the classical source domain expected error. ^dpAi-L {Ds, Dp) 
depends on R and corresponds to the maximum disagreement between two hypotheses of H. In other 
words, it quantifies how hypothesis from R can “detect” differences between these marginals: the lower 
this measure is for a given R, the better are the generalization guarantees. The last term ph- = 
Rps{h*) -I- Rp^{h*) is related to the best hypothesis h* over the domains and act as a quality mea¬ 
sure of R in terms of labeling information. If h* does not have a good performance on both the source 
and the target domain, then there is no way one can adapt from this source to this target. Hence, as 
pointed out by the authors, Equation (1), together with the usual VC-bound theory, express a multiple 
trade-off between the accuracy of some particular hypothesis h, the complexity of R, and the “incapacity” 
of hypotheses of R to detect difference between the source and the target domain. 

Second, Mansour et al. (2009a) have extended the 7^'H-distance to the discrepancy divergence for 
regression and any symmetric loss C fulfilling the triangle inequality. Given L : [—1,-|-1]^ —>■ K’*' such a 
loss, the discrepancy disc£(Zls, Dp) between Ds and Dp is 

disc£(Ds, Dp) = sup E £(/i(x‘),/i'(x*)) — E £(h(x®),/i'(x®)) . 

{h.h')eH^ xo-ns 

^In a symmetric hypothesis space T-l, for every h GH, its inverse —h is also in T-L. 
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Note that with the 0-1 loss in binary classification, we have 

^d-HAniDsT Dt) = disc£^^ (£>5, Dt) . 

Even if these two divergences may coincide, the following domain adaptation bound of Mansour et al. 
(2009a) differs from Theorem 1. 

Theorem 2 (Mansour et al. (2009a)). Let H be a (symmetrie) hypothesis elass. We have 

Wh gV., Rp^{h) - Rp^{h^) < i?£)s(/i5, h)-I-disc£^^ (£>s, £>t) + , (2) 


where 

'^{h‘‘g,h1p) = Rosihsjh^) 

is the disagreement between the ideal hypothesis on the target and source domains defined respectively as 

= argmin iip,j,(/i), and h*g = argmin (h). 
hen hen 

In this context. Equation (2) can be tighter® since it bounds the difference between the target error 
of a classifier and the one of the optimal h^. This bound expresses a trade-off between the disagreement 
(between h and the best source hypothesis hj), the complexity of R (with the Rademacher complexity), 
and—again—the “incapacity” of hypothesis to detect differences between the domains. 

To conclude, the domain adaptation bounds (1) and (2) suggest that if the divergence between the 
domains is low, a low-error classifier over the source domain might perform well on the target one. These 
divergences compute the worst ease of the disagreement between a pair of hypothesis. We propose in 
Section 4 an average case approach by making use of the essence of the PAC-Bayesian theory, which is 
known to offer tight generalization bounds (McAllester, 1999; Germain et al., 2009a; Parrado-Hernandez 
et al., 2012). 

3 PAC-Bayesian Theory in Supervised Learning 

Let us now review the classical supervised binary classification framework called the PAC-Bayesian theory, 
first introduced by McAllester (1999). This theory succeeds to provide tight generalization guarantees on 
majority vote classifiers, without relying on any validation set. 

Throughout this section, we adopt an algorithm design perspective: we interpret the various forms 
of the PAC-Bayesian theorem as a guide to derive new machine learning algorithms. Indeed, the PAC- 
Bayesian analysis of domain adaptation provided in the forthcoming sections is oriented by the motivation 
of creating a new adaptive algorithms. 

3.1 Notations and Setting 

Traditionally, the PAC-Bayesian theory considers weighted majority votes over a set % of binary hypoth¬ 
esis. Given a prior distribution tt over % and a training set S, the learner aims at finding the posterior 
distribution p over R leading to a p-weighted majority vote Bp (also called the Bayes classifier) with good 
generalization guarantees and defined by 

E h(x) . 

.h'^p J 

Minimizing Rpg{Bp) the risk of Bp is known to be NP-hard. In the PAC-Bayesian approach, it is replaced 
by the risk of the stochastic Gibbs classifier Gp associated with p. In order to predict the label of an 
example x, the Gibbs classifier first draws a hypothesis h from R according to p, then returns h(x) as 
label. Note that the error of the Gibbs classifier on a domain Pp corresponds to the expectation of the 
errors over p: 


Bp(x) sign 


Rp,{Gp) E Rp,{h). 


(3) 


^Equation (1) can lead to an error term 3 times higher than Equation (2) in some cases (Mansour et al., 2009a). 
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In this setting, if Bp misclassifies x, then at least half of the classifiers (under p) errs on x. Hence, we 
have 

RPsiBp) < 2Rp,{Gp). 

Another result on the relation between Rp^{Bp) and Rp^{Gp) is the C-bound of Lacasse et al. (2006) 
expressed as 


Rps^Bp) < 1 


{l-2Rp,{Gp)f 

l-2RD^{Gp,Gp) ’ 


(4) 


where RD^{Gp,Gp) corresponds to the disagreement of the classifiers over p: 


RDsiGp,Gp) "A*' 


E RosiKh'). 


(5) 


Equation (4) suggests that for a fixed numerator, i.e., a fixed risk of the Gibbs classifier, the best majority 
vote is the one with the lowest denominator, ie., with the greatest disagreement between its voters (see 
Laviolette et al. (2011) for further analysis). 

Finally, we introduce the notion of expected joint error of a pair of classihers (h, h') drawn according 
to the distribution p, defined as 


ep< 


(Gp,Gp) 


E 


E 

(x.I/)- 


Ps 


4.1 (^(x),?/) X . 


( 6 ) 


The PAG-Bayesian theory allows one to bound the expected error Rp^{Gp) in terms of two major 
quantities: the empirical error Rs{Gp) — Rsih) estimated on a sample S drawn i.i.d. from Ps and 
the Kullback-Leibler divergence KL(p||7r) ^h~p (let us recall that tt and p are respectively the 

prior and the posterior distributions). The three main PAG-Bayes theorems, that we present in the next 
section, have been proposed by McAllester (1999); Seeger (2002); Langford (2005); and Catoni (2007). 


3.2 Three Versions of the PAC-Bayesian Theorem 

First, let us consider the KL-divergence kl(a || h) between two Bernoulli distributions with success proba¬ 
bility a and 6, defined by 

kl(a||6) a In --I-(1 — a) In 

Seeger (2002); and Langford (2005) have derived the following PAG-Bayesian theorem in which the trade¬ 
off between the complexity and the risk is handled by kl(-||-). 

Theorem 3 (Seeger (2002); Langford (2005)). For any domain Ps over X xY, any set of hypotheses 
TL, and any prior distribution it over H, any 5 S (0,1], with a probability at least 1 — (5 over the choice of 
S (Ps)^, for every p over H, we have 


kl 


{RsiGp) 



< 


1 

m 


KL(p II tt) -I- In 



This version of the PAG-Bayes theorem offers a tight bound, especially for low empirical risk. However, 
due to the kl (i?s(Gp) || Rpg{Gp)) term, this bound remains difficult to interpret: the link between the 
empirical risk Rs{Gp) and the “true” risk Rp^ (Gp) is not given by a close form. Thus, from an algorithmic 
point of view, finding the distribution p that minimizes the bound on Rp^{Gp) given by Theorem 3 might 
be a difficult task. 

The following version of the PAG-Bayes theorem, which was the first proposed (McAllester, 1999), 
appears easier to interpret since it links the terms Rs{Gp) and Rpg{Gp) by a linear relation. Note that 
Theorem 4 can be straightforwardly obtained from Theorem 3 using Pinsker’s inequality: 

2-{q-pf < kl(q||p). (7) 


Theorem 4 (McAllester (1999)). For any domain Ps over X x Y, any set of hypotheses R, any prior 
distribution tt over H, and any S G (0,1], with a probability at least 1 — (5 over the choice of S (Ps)^, 
for every p over H, we have 


PpsiGp) - RsiGp) 


< 

i /4 


V 2m [ 


KL(p II tt) -I- In 


2y/rn 
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Theorems 3 and 4 suggest that, in order to minimize the expected risk, a learning algorithm should 
perform a trade-off between the empirical risk minimization RsiGp) and KL-divergence minimization 
KL(p II tt) (roughly speaking the complexity term). 

The nature of this trade-off can be explicitly controlled in Theorem 5 below. This PAC-Bayesian 
result, first proposed by Catoni (2007), is defined with a hyperparameter (here named c). It appears to be 
a natural tool to design PAC-Bayesian algorithms. We present this result in the simplihed form suggested 
by Germain et al. (2009b). 

Theorem 5 (Catoni (2007)). For any domain Ps over X x Y, for any set of hypotheses FL, any prior 
distribution tt over FL, any 5 S (0,1], and any real number c > 0, with a probability at least 1 — 5 over the 
choice of S (Ps)^, for every p on FL, we have 


RpsiGp) < 


C 


1 — e ^ 


Rs{Gp) 


KL(p||7r) -Hn|' 
m X c 


The bound given by Theorem 5 has two interesting characteristics. First, choosing c = the bound 
becomes consistent: it converges to 1 x [i?s(Gp) -|-0] as m grows. Second, as described in Section 3.3, 
its minimization is closely related to the minimization problem associated with the SVM when p is an 
isotropic Gaussian over the space of linear classihers (Germain et ah, 2009a). Hence, the value c allows 
us to control the trade-off between the empirical risk Rs{Gp) and the complexity term A KL(/9|j7r). 


3.3 Supervised PAC-Bayesian Learning of Linear Classifiers 

Let us consider FL as a set of linear classifiers in a d-dimensional space. Each hw' € FL is defined by a 
weight vector w' € 

hw' (x) = sgn (w' • x), 

where • denotes the dot product. 

By restricting the prior and the posterior distributions over FL to be Gaussian distributions, Langford 
and Shawe-Taylor (2002); Ambroladze et al. (2006); and Parrado-Hernandez et al. (2012) have specialized 
the PAG-Bayesian theory in order to bound the expected risk of any linear classiher h^ G FL. More 
precisely, given a prior ttq and a posterior p^ defined as spherical Gaussians with identity covariance 
matrix respectively centered on vectors 0 and w, for any h^,' G FL, we have 






exp 1 “2 11'^ 


and Pw(/iw') = 


An interesting property of these Gaussian distributions is that the prediction of the Pw-weighted majority 
vote Bp^{-) coincides with the one of the linear classifier hw(-)- Indeed, we have 

VxeA, VwSH, /iw(x) = Hp^(x) 

= sign E /iw'(x) 

^w' ~Pw 

Moreover, the expected risk of the Gibbs classifier Gp„ on a domain Ps is then given by 
RpsiGpJ E E C„,{h^,{x),y) 

~ I ^ D f. ^ ^ (^w'(x) 7^ y) 

i^,v)~Ps h^i~P,^ 


= E 

= E 


_ ^ 

(x,I/)~Ps 

_ ^ 

(x,y)~Ps 


E 

I (yw' 

1 f 


/ exp 

1- 

Pr ( 


w(o.i) V 

( w • x\ 

—1 

x|! j’ 


I (y w' • X < O) d w' 


t < y 


Technical Report V 2 


7 
















Germain, Habrard, Laviolette, Morvant 


PAC-Bayesian Theorems for Domain Adaptation 


where we defined 


$(a) = 


1 - Erf 


VV 


with Erf(-) is the Gauss error function defined as 


Erf (6) = f exp(—t^)dt. 

Jo 


Finally, the KL-divergence between and ttq becomes simply 

KL(pw||7ro) = lllwf . 


( 8 ) 


3.3.1 Objective Function and Gradient 

Based on the specialization of the PAG-Bayesian theory to linear classifiers, Germain et al. (2009a) sug¬ 
gested minimizing a PAC-Bayesian bound on Rp^{Gp^). For sake of completeness, we provide here more 
mathematical details than in the original conference paper (Germain et ah, 2009a). We will build on this 
PAC-Bayesian learning algorithm (for supervised leaning) in our domain adaptation work. 

Given a sample S = {(x|,y|)}(lL]^ and a hyperparameter C > 0, the learning algorithm performs a 
gradient descent in order to find an optimal weight vector w that minimizes 


F(w) = C7mi?s(Gp„)-f KL(pw||7ro) 

m / X 

' (9) 

It turns out that the optimal vector w corresponds to the distribution that minimizes the value of the 
bound on Rp^ (Gpw) given by Theorem 5, with the parameter c of the theorem being the hyperparameter C 
of the learning algorithm. It is important to point out that PAC-Bayesian theorems bound simultaneously 
i?Ps(Gpw) for every p^ on H. Therefore, one can “freely” explore the domain of objective function F to 
choose a posterior distribution p^ that gives, thanks to Theorem 5, a bound valid with probability 1 — <5. 

The minimization of Equation (9) by gradient descent corresponds to the learning algorithm called 
PBGD3 of Germain et al. (2009a). The gradient of F{w) is given the vector VF(w); 


VF(w) = G^d>' 
2 = 1 



Vi 

llx*|| 


-I- W, 


where $'(a) = exp (—is the derivative of $(•) at point a. 

Similarly to the SVM, the learning algorithm PBGD3 realizes a trade-off between the empirical risk 
(expressed by the loss $(•)) and the complexity of the learned linear classifier (expressed by the regularizer 
||w|p). This similarity increases when we use a kernel function, as described next. 


3.3.2 Using a kernel fnnction 

The kernel trick allows to substitute inner products by a kernel function /c : x —>■ K in Equation (9). 

If fc is a Mercer kernel, it implicitly represents a function (/) : X —that maps an example of X into 
an arbitrary d'-dimensional space®, such that 

V(x, x')eA^, fc(x,x') = . 

Then, a dual weight vector a = (ai, a 2 ,..., am) € R™ encodes the linear classifier -w G as a linear 
combination of examples of S: 


m 

w = ai (j){x.i), and thus h^{x) 

i=l 


sgn 


m 

y^Q;ifc(xj,x) 


®We consider here that the induced space is finite-dimensional. 
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By the representer theorem (Scholkopf et ah, 2001), the vector w minimizing Equation (9) can be 
recovered by finding the vector a that minimizes 


m / \ ^ m m 

F(a) = 


( 10 ) 


i=l j=l 


where K is the kernel matrix of size m x m. That is, Kij = k{x.i,x.j). The gradient of F{a) is simply 
given the vector \/F{a) = (a'l, 


«# = 


y^ 




i=l 


+ for # e {l,2,...,m}. 

i=i 


3.3.3 Improving the Algorithm Using a Convex Objective 


An annoying drawback of PBGD3 is that the objective function is non-convex and the gradient descent 
implementation needs many random restarts. In fact, we made extensive empirical experiments after the 
ones described by Germain et al. (2009a) and saw that PBGD3 achieves an equivalent accuracy (and at a 
fraction of the running time) by replacing the loss function <!)(•) of Equations (9) and (10) by its convex 
relaxation, which is 


d^cvx(^) 


def 


max 



a 

•\/27t 



a 


if a < 0, 
otherwise. 


The derivative of d’cvx(-) at point a is then $(.^,;(a) = if a < 0, and <i)'(a) otherwise. Note that 
Eigure 1 in Section 5 illustrates the functions <!>(•) and $cvx(’) • 


In the following we present our contributions on PAG-Bayesian domain adaptation. 


4 PAC-Bayesian Theorems for Domain Adaptation 

The originality of our contribution is to theoretically design a domain adaptation framework for PAC- 
Bayesian approach. In Section 4.1, we propose a domain comparison pseudometric suitable in this context. 
We then derive PAC-Bayesian domain adaptation bounds in Section 4.2, that improves the result proposed 
in Germain et al. (2013). Finally, note that in Section 5 we see that using the previous approach in a 
domain adaptation way is a relevant strategy: we specialize our result to linear classifiers. 

4.1 A Domain Divergence for PAC-Bayesian Analysis 

In the following, while the domain adaptation bounds presented in Section 2 focus on a single classifier, 
we first define a p-average disagreement measure to compare the marginals. Then, this leads us to derive 
our domain adaptation bound suitable for the PAC-Bayesian approach. 

As discussed in Section 2.2, the derivation of generalization ability in domain adaptation critically 
needs a divergence measure between the source and target marginals. 

4.1.1 Designing the Divergence 

We define a domain disagreement pseudometric^ to measure the structural difference between domain 
marginals in terms of posterior distribution p over %. Since we are interested in learning a p-weighted 
majority vote Bp leading to good generalization guarantees, we propose to follow the idea behind the 
C-bound presented in Equation (4): given Pg, Ft, and p, if Rp^iGp) and Rp^(Gp) are similar, then 
Rps{Bp) and Rp^{Bp) are similar when E Rp,^{h,h') and E Rp^{h,h') are also similar. 

Thus, the domains Pg and Pp are close according to p if the divergence between E Rds 

E Rp)„(h, h') tends to be low. Our pseudometric is defined as follows. 

pseudometric d is a metric for which the property d{x, y) = 0 4^ x = y is relaxed to d{x, y) = 0 <= x = y. 
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Definition 1. Let % he a hypothesis class. For any marginal distributions Ds and Dt over X, any 
distribution p on %, the domain disagreement A\Sp{Ds, Dt) between Ds and Dt is defined by 


diSp(Ds, Dt) 


def 

E 

Rdt (h, h') — Rds {h, h') 





Rdt (Gp, Gp) — Rds {Gp, Gp) 


Note that diSp(-, •) is symmetric and fulfills the triangle inequality. 

4.1.2 Comparison of the IfAH-divergence and our domain disagreement 

While the IfAH-divergence of Theorem 1 is difficult to jointly optimize with the empirical source error, our 
empirical disagreement measure is easier to manipulate: we simply need to compute the p-average of the 
classifiers disagreement instead of finding the pair of classifiers that maximizes the disagreement. Indeed, 
diSp(-, •) depends on the majority vote, which suggests that we can directly minimize it via the empirical 
diSp{S,T) and the KL-divergence. This can be done without instance reweighing, space representation 
changing or family of classifiers modihcation. On the contrary, ^dyAni'! ■) is a supremum over all h G R 
and hence, does not depend on the h on which the risk is considered. Moreover, diSp(', •) (the p-average) 
is lower than the \dyAn{', •) (the worst case). Indeed, for every % and p over R, we have 

\d-HAH{Ds,DT) = sup \RDT{h,h') - RT,s{h,h')\ 

(h,h')eH^ 

> E \RDAh,h')-RDs(.h,h')\ 

{h,h')r^P^ 

> diSp{Ds,DT) ■ 


4.1.3 PAC-Bayesian bounds for our domain disagreement 

The following theorems show that diSp{Ds, Dt) can be bounded in terms of the classical PAC-Bayesian 
quantities: the empirical disagreement diSp(5', T) estimated on the source and target samples, and the 
KL-divergence between the prior and posterior distribution on R. 

For the sake of simplicity, let first suppose that m = m', i.e., the size of S and T are equal. Here is a 
“Seeger’s type” PAC-Bayesian bound for our domain disagreement diSp. 

Theorem 6. For any distributions Ds and Dt overX, any set of hypotheses R, and any prior distribution 
TT over R, any S G (0,1], with a probability at least 1 — <5 over the choice of S x T ^ (Ds x Dt)™", for 
every p on R, we have 


kl 


disp(5', T) + 1 


diSp{Ds,DT) + 1 


< 


1 

771 


2 KL(p||7r) -y In 



Proof. Deferred to Appendix B. □ 

Here is a “McAllester’s type” PAC-Bayesian bound for our domain disagreement diSp obtained straight¬ 
forwardly from Theorem 6. 

Corollary 1. For any distributions Ds and Dt over X, any set of hypotheses R, and any prior distri¬ 
bution TT over R, any 6 G (0,1], with a probability at least 1 — <5 over the choice of S x T (Ds x Dt)"^, 
for every p on R, we have 


diSp{Ds,DT) - diSp{S,T) 


< 2 X 


2 KL(p||7r) -I- In 


2i/m 


Proof. The result is obtained by using Pinsker’s inequality (Equation (7)) on Theorem 6. □ 

Here is a “Catoni’s type” PAC-Bayesian bound which helps us to derive a domain adaptation algorithm 
in the following. 
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Theorem 7. For any distributions Ds and Dt over X, any set of hypotheses %, any prior distribution 
TT over %, any 5 S (0,1], and any real number a > 0, with a probability at least 1 — ^ over the choice of 
S X T {Ds X Dt)'^ , for every p on H, we have 


diSp{Ds,DT) < 


2a 

1 - e-2“ 


diSp(5', T) 


2KL(p||7r) + ln| ^ 1 _ ^ 
m X a 


Proof. Deferred to Appendix C. □ 

Similarly to the empirical risk bound of Catoni (2007) shown by Theorem 5, the above domain dis¬ 
agreement bound is consistent if one puts a = Indeed, it converges to 1 x [diSp(S', T) -|- 0 -|- 1] — 1 as 
m grows. 

The last result of this section tackles the situation where m ^ m', i.e., the sizes of S and T are 
different. 


Theorem 8. For any marginal distributions Ds and Dt over X, any set of hypotheses %, any prior 
distribution tt over H, any S G (0,1], with a probability at least 1 — d over the choice of S ^ {Ds)"^ and 
T ^ {Dt)^ , for every p over H, we have 


diSp{Ds, Dt) - diSp(S', T) 


< 


'2KL(p|j7r) + ln^ , 12KL{p\\7r) + 


2m 


2m' 


Proof. Deferred to Appendix D. □ 

Note that Theorem 8 is very similar to the result of Corollary 1. In fact, in the particular case m = m', 
Theorem 8 differs from Corollary 1 only by the d^/m term inside the logarithm, instead of 2^/rn. 


4.2 PAC-Bayesian Theorems for Domain Adaptation 

We now derive our main result in the following theorem: a domain adaptation bound relevant in a PAC- 
Bayesian setting. 


4.2.1 A domain adaptation bound for the stochastic Gibbs classifier 

Theorem 9 below relies on the domain disagreement of Definition 1, and also on expected joint error of 
Equation (6). 

Theorem 9. Let % be a hypothesis class. We have 

\/pon'H, Rp.^{Gp) < Rpg{Gp) + ^d\Sp{Ds,DT) + \p, 
where Xp is the deviation between the expected joint errors of Gp on the target and source domains: 


\ 

Xp — 


E 


, E /:o_,(h(x),?/)4_,(h'(x),j/)- E C^^{h{x),y)C„_,{h'iK),y) 

(x,y)~PT {x,y)~P5 


ep^ (Gp, Gp) — epg (Gp, Gp) 


( 11 ) 


Proof. First, notice that for any distribution P on X xY (and corresponding marginal distribution D on 
A), we have 

Rp{Gp) = ^i?p(Gp,Gp) + ep(Gp,Gp), (12) 

as 


2Rp{Gp) 


E E 

(h,h')~p'^ {x,y)~P L 

E E 

{h,h')~p'^ {x,y)~P L 


4-i(^(x),y) +^o-i(^'(x),y) 

1 X ^ 0-1 (^(x), h'{x)) + 2x (/i(x), y) (h'(x), y) 


Rd{Gp, Gp) -|- 2 X ep{Gp, Gp). 
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Therefore, 


< 


RpriGp) — Rps(Gp) = - (^Rd^{Gp, Gp) — RosiGp, Gp)^ + (^ep^{Gp, Gp) — ep^{Gp, Gp)^ 

RDri^pj^p) ~ RosiGp^Gp) + ep^{Gp,Gp) — epg{Gp,Gp) 

diSp(I?s, Dp) + Ap . 


□ 

Our bound is, in general, incomparable with the ones of Theorems 1 and 2. It can be seen as a trade¬ 
off between different quantities. The terms i?Pg(Gp) and d\Sp{Ds,Dp) are similar to the first two terms 
of the domain adaptation bound of Ben-David et al. (2010a) (Equation (1)): Rp^{Gp) is the p-average 
risk over % on the source domain, and diSp^Dp^ DS) measures the p-average disagreement between the 
marginals but is specific to the current p. The other term Ap measures the deviation between the expected 
joint target and source errors of Gp. According to this theory, a good domain adaptation is possible if 
this deviation is low. However, since we suppose that we do not have any label in the target sample, we 
cannot control or estimate it. In practice, we suppose that Ap is low and we neglect it. In other words, we 
assume that the labeling information between the two domains is related and that considering only the 
marginal agreement and the source labels is sufficient to find a good majority vote. Another important 
point comes from the fact that this bound is not degenerated when the source and target distributions 
are the same or close, see Section 8.2 for a discussion on this point. 

In the next section, we provide three PAC-Bayesian theorems that justifies the empirical optimization 
of the bound of Theorem 9. 


4.2.2 PAC-Bayesian theorems for domain adaptation 

Finally, our Theorem 9 leads to a PAC-Bayesian bound based on both the empirical source error of the 
Gibbs classifier and the empirical domain disagreement pseudometric estimated on a source and target 
samples. 

From the preceding “Seeger’s type” results, one can then obtain the following PAC-Bayesian domain 
adaptation bound. 

Theorem 10. For any domains Ps and Pp (respectively with marginals Ds and Dp) over X xY, any 
set of hypotheses H, any prior distribution tt over H, and any S G (0,1], with a probability at least 1 — 5 
over the choice of S xT ^ {Ps x Dp)"^, we have 

Rp.j. (Gp) < sup TZp + ^ sup I?p -I- Ap , 


where Ap is defined by Equation (11), and 


TZp 

Rp 


{r :kl(i? 5 (Gp)||r) < 


KL(p|l 7 r) -l-ln^j 




|^:]^l(dM^T)+i||^) ^ ^ j2KL(p||7r)+ln^]}. 


Proof. The result is obtained by inserting Theorems 3 and 6 (with S := |) in Theorem 9. □ 

The following bound is based on Catoni’s approach and corresponds to the one from which we derive— 
in Section 5 —our algorithm for PAC-Bayesian domain adaptation. 

Theorem 11. For any domains Pp and Pp (resp. with marginals Ds and Dp) over X xY, any set of 
hypotheses H, any prior distribution tt over H, any 6 G (0,1], any real numbers a > 0 and c > 0, with a 
probability at least 1 — 5 over the choice of S x T ^ {Ps x Dp)^, for every posterior distribution p on FL, 
we have 


Rpt i^p) — ^ Rs{Gp) + a' ^ diSp(S', T) -I- ( —h 


c' a'A KL(p||7r)-I-In ; 


where Ap is defined by Equation (11), and where c' = 


c a 

c 


■ -I- Ap -I- }^{a' — 1), 


-, and a' = 


1 — e“ 


2a 
1 — e“ 
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Proof. In Theorem 9, we replace Rs{Gp) and diSp(5', T) by their upper bound, obtained from Theorem 5 
and Theorem 7, with S chosen respectively as | and ^. In the latter case, we use 

2KL(p||7r)+ln 2 ^ = 2 KL(p||7r) + In f 

< 2(KL(p|K)+ln|) . 


□ 

We now present a result based on the McAllester bound, which allows us to easily deal with different 
sizes of samples. 

Theorem 12. For any domains Ps and Pt (respectively with marginals Ds and Dt) over X xY, and 
for any set FL of hypotheses, for any prior distribution tt over FL, any 5 G (0,1], with a probability at least 
1 — 5 over the choice of Si ~ (Ps)'^^, S 2 ~ , and T ~ , for every p over FL, we have 

Rpt {Gp) < Rsi (Gp) + I diSp(S'2, T) + Ap 

2KL(p||7r)+ln5^ 

8m' 

where Ap is defined by Equation (11). 

Proof. We insert Theorems 4 and 8 (with 5 := |) in Theorem 9. □ 

Under the assumption that the domains are somehow related in terms of labeling agreement on Ps 
and Pt (for every distribution p over FL), i.e., a low diSp{Ds, Dt) implies a negligible Ap, a natural 
solution for a PAC-Bayesian domain adaptation algorithm without target label is to minimize the bound 
of Theorem 11 by disregarding Ap. Notice that a major advantage of our domain adaptation bound is 
that we can jointly optimize the risk and the divergence with a theoretical justihcation. 

5 PAC-Bayesian Domain Adaptation Learning of Linear Classi¬ 
fiers 

In this section, we design a learning algorithm for domain adaptation inspired by the PAC-Bayesian 
learning algorithm of Germain et al. (2009a). That is, we adopt the specialization of the PAC-Bayesian 
theory to linear classifiers described in Section 3.3. Note that the code of our algorithm is available 
on-line.® 

5.1 Minimizing the PAC-Bayesian Domain Adaptation Bonnd 

Let us consider a prior ttq and a posterior pw that are spherical Gaussian distributions over a space of 
linear classifiers, exactly as defined in Section 3.3. 

Given a source sample S = {{xf, yf)}'fFi and a target sample T = {(x-)}™^, we focus on the mini¬ 
mization of the bound given by Theorem 11. We work under the assumption that the term Ap^ of the 
bound is negligible. Thus, the posterior distribution p.,^, that minimizes the bound on Rt{Gp^) is the 
same that minimizes 

GmRs{Gp^) + Am diSp^{S,T) + KL(pw|Ko) ■ (13) 

The values A > 0 and C > 0 are hyperparameters of the algorithm. Note that the constants a and c of 
Theorem 11 can be recovered from any A and G . 

5.1.1 Domain Disagreement of Linear Classifiers 

We know from Equation (9) how to compute the terms Rs{Gp^) and KL(pw|Ko) of Equation (13). Let 
us now derive the value of diSp,,,(S',T), i.e., the empirical domain disagreement between S and T of a 
distribution p^ over linear classifiers. 

®See http://graal.ift.ulaval.ca/pbda. 


/2KL(p|j7r) 


■In^ 


8 to 2 
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First, for any marginal D, we obtain 


Rd{Gp„,G,J = E E 4,(h(x),h'(x)) 

[h,h')r^p^ 

= E E I[h(x) ^ h'(x)] 

x~D {h,h')r~^p^ 


Thus, 


where 


= E E 

x~D 


(l[h(x) = 1] I[/i'(x) = -1] + I[h(x) = -1] I[/i'(x) = 1]) 
I[h(x) = l]I[h'(x) = -l] 


= 2 E E 

= 2 E E I[h(x) = l] E I[h'(x) = -1] 

X~1J ~Pw 


= 2 E d> 


$ _ 


disp„(5,r) = Rs{Gp^,GpJ-RT{,Gp^,GpJ 


- Y, ^dis 


2=1 


1 


- H ^dis 


2=1 


$dis(a) = 2$(a)$(-a). 


(14) 


5.1.2 Objective Function and Gradient 

From the results of Sections 3.3.1 and 5.1.1, we obtain that Equation (13) equals to 




i=l 


2/z 


+ A 


E 

i=l ‘- 


d*dis 


W ■ Xf \ ^ / W ■ X- 

^dis 




which is highly non-convex. To make the optimization problem more tractable, we replace the loss function 
$(•) by its convex relaxation <i>cvx(’) (as in Section 3.3.3) and minimize the resulting cost function by 
gradient descent. Even if this optimization task is still not convex (4>dis(’) is quasiconcave), our empirical 
study shows no need to perform many restarts to find a suitable solution.® 

We name this domain adaptation algorithm PBDA. To sum up, given a source sample S = {(xf, yf )}fTi, 
a target sample T = {(x*)}”fj^, and hyperparameters A and C, the algorithm PBDA performs gradient 
descent to minimize the following objective function: 


G(w) 


i=l 


Vi- 


A 


E 

2=1 


^dis 


^dis 


2IIWII 


(15) 


where 


T / \ def 1 




^’cvx(a) = max<j d>(a), ^ j 


1, 


$dis(a) = 2 X $(a) X $(-a), 

with Erf(-) the Gauss error function defined in Equation (8). Figure 1 illustrates these three functions. 
The gradient VG(w) of the Equation (15) is then given by 


VG(w) =G^<, 


i=l 


y-w ■ Xf \ yfxf 


+ sx 


AT. 


^2 = 1 


‘i’dis 




dis 


^We observe empirically that a good strategy is to first find the vector w minimizing the convex problem of PBGD3 
described in Section 3.3.3, and then use this w as a starting point for the gradient descent of PBDA. 
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Figure 1: Behavior of functions $(•)> d’cvx(-) and $dis(-)- 

where ‘^dis(®) respectively the derivatives of functions $cvx(’) and $dis(’) evaluated at 

point a, and 

/ ™ r .^s\ ' 

s = sgn 


2 = 1 


^dis ( I - $dis ^ 


We extend these equations to kernels in the following subsection. 

5.1.3 Using a Kernel Function 

The kernel trick allows us to work with dual weight vector a G that is a linear classifier in an 

augmented space. Given a kernel k : x ^ M., we have 


hw(x) = sgn 


e^i+mk{x ^, x) 


Let us denote K the kernel matrix of size 2m x 2m such as Kij =' /c(xi, x^-), where 

X - / if # < m 

^ lx^_^ otherwise. 

In that case, the objective function of Equation (15) is rewritten in terms of the vector a = (ai, 02 , • ■ • <a 2 m) 
as 


G(a)=C^d>. 


E 2m Ty' 

i=l 


cvx 1 iJi 


i=l 


A 


E 

2=1 


«'dis - -- - - - d>dis 


y/K, 


2 +m, 2 +m 


+ b E E 


i=l j = l 


The gradient of the latter equation is given by the vector VG(a) = (a(, with 


m / ^ \ 2 m 

4 


# 


2 = 1 


5X 


where 


s = Sgn [ E 




/Ki 


- dis 


^/K■. 


2 +m, 2 +m 




\i=l 


^dis - - «'dis 




+m,2+m 
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6 Experiments 

6.1 General Setup 

PBDA^*^ has been evaluated on a toy problem and a sentiment dataset. For our experiments, we minimize 
the objective function using a Broyden-Fletcher-Goldfarb-Shanno method (BFGS) implemented in the 
scipy python library^^. PBDA has been compared with: 

• SVM learned only from the source domain, i.e., without adaptation. We made use of the SVM-light 
library (Joachims, 1999). 

• PBGD3, presented in Section 3.3, and learned only from the source domain, be., without adaptation. 

• DASVM of Bruzzone and Marconcini (2010), an iterative domain adaptation algorithm which tries to 
maximize iteratively a notion of margin on self-labeled target examples. We implemented DASVM 
with the LibSVM library (Chang and Lin, 2001). 

• CODA of Chen et al. (2011), a co-training domain adaptation algorithm, which looks iteratively for 
target features related to the training set. We used the implementation provided by the authors. 
Note that Chen et al. (2011) have shown best results on the dataset considered in our Section 6.4. 

Each parameter is selected with a grid search via a classical 5-folds cross-validation on the source 
sample for PBGD3 and SVM, and via a 5-folds reverse/circular validation (^C’'^) on the source and the 
(unlabeled) target samples for CODA, DASVM, and PBDA. We describe this latter point in the following 
section. Note that for PBDA we search on a 20 x 20 parameter grid for a A between 0.01 and 10® and a 
parameter C between 1.0 and 10®, both on a logarithm scale. 

6.2 A Note about the Reverse Validatiou 

A crucial question in domain adaptation is the validation of the hyperparameters. One solution is to follow 
the principle proposed by Zhong et al. (2010) which relies on the use of a reverse validation approach. 
This approach is based on a so-called reverse classifier evaluated on the source domain. We propose to 
follow it for tuning the parameters of PBDA, DASVM and CODA. Note that Bruzzone and Marconcini 
(2010) have proposed a similar method, called circular validation, in the context of DASVM. 

Concretely, in our setting, given fc-folds on the source labeled sample {S = SiU .. .\J Sk), fc-folds on 
the unlabeled target T sample (T = Ti U ... U T^) and a learning algorithm (parametrized by a fixed tuple 
of hyperparameters), the reverse cross validation risk on the fold is computed as follows. Firstly, the 
source set S\ Si is used as a labeled sample and the target set T\Ti is used as an unlabeled sample for 
learning a classifier h'. Secondly, using the same algorithm, a reverse classifier h'^ is learned using the 
self-labeled sample {(x, h'(x))}xeT\Ti as the source set and the unlabeled part of S'\ S'i as target sample. 
Finally, the reverse classifier h'^ is evaluated on Si. We summarize this principle on Figure 2. The process 
is repeated k times to obtain the reverse cross validation risk averaged across all folds. 

6.3 Toy Probleui: Two luter-Twiuuiug Moous 

The source domain considered here is the classical binary problem with two inter-twinning moons, each 
class corresponding to one moon (Figure 3). We then consider seven different target domains by rotating 
anticlockwise the source domain according to seven angles (from 10° to 90°). The higher the angle, the 
more difficult the problem becomes. For each domain, we generate 300 instances (150 of each class). More¬ 
over, to assess the generalization ability of our approach, we evaluate each algorithm on an independent 
test set of 1,000 target points (not provided to the algorithms). We make use of a Gaussian kernel for 
all the methods. Each domain adaptation problem is repeated ten times, and we report the average error 
rates on Table 1. Note that since CODA decomposes features for applying co-training, it is not appropriate 
here (we have only two features). 

We remark that our PBDA provides the best performances except for 50° and 20°, indicating that 
PBDA accurately tackles domain adaptation tasks. It shows a nice adaptation ability, especially for the 
hardest problem, probably due to the fact that diSp is tighter and seems to be a good regularizer in a 
domain adaptation situation. The adaptation versus risk minimization trade-off suggested by Theorem 12 

made our code available at the following URL: http://graal.ift.ulaval.ca/pbda/ 

^^Available at http://www.scipy.org/ 
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reverse classifier h'” 

from unlabeled s and auto-labeled 


Figure 2: The principle of the reverse/circular validation in our setting. 


Table 1: Average error rate results for seven rotation angles. 


PBGDS'^^ 

SVM'^'^ 

DASVM^*^^ 

PBDA^'^^ 

10° 1 

0 

0 

0 

0 

O 

O 

0.088 

0.104 

0 

0.094 

o 

O 

CO 

0.210 

0.24 

0.259 

0.103 

o 

o 

0.273 

0.312 

0.284 

0.225 

50° 

0.399 

0.4 

0.334 

0.412 

o 

o 

0.776 

0.764 

0.747 

0.626 

90° 1 

0.824 

0.828 

0.82 

0.687 


appears in Figure 3. Indeed, the plot illustrates that PBDA accepts to have a lower source accuracy to 
maintain its performance on the target domain, at least when the source and the target domains are not 
so different. Note, however, that for large angles, PBDA prefers to “focus” on the source accuracy. We 
claim that this is a reasonable behavior for a domain adaptation algorithm. 

6.4 Sentiment Analysis Dataset 

We consider the popular Amazon reviews dataset (Blitzer et ah, 2006) composed of reviews of four types 
of Amazon.com® products (books, DVDs, electronics, kitchen appliances). Originally, the reviews cor¬ 
responded to a rate between one and five stars and the feature space (of unigrams and bigrams) has on 
average a dimension of 100,000. For sake of simplicity and for considering a binary classification task, 
we propose to follow a setting similar to the one proposed by Chen et al. (2011). Then the two possible 
classes are: -1-1 for the products with a rank higher than 3 stars, —1 for those with a rank lower or equal to 
3 stars. The dimensionality is reduced in the following way: Chen et al. (2011) only kept the features that 
appear at least ten times in a particular DA task (it remains about 40,000 features), and pre-processed 
the data with a standard tf-idf re-weighting. One type of product is a domain, then we perform twelve 
domain adaptation tasks. For example, “books—^-DVDs” corresponds to the task for which books is the 
source domain and DVDs the target one. The algorithms use a linear kernel and consider 2, 000 labeled 
source examples and 2,000 unlabeled target examples. We evaluate them on separate target test sets 
proposed by Chen et al. (2011) (between 3,000 and 6,000 examples), and we report the results on Table 
2. We make the following observations. 

First, as expected, the domain adaptation approaches provide the best average results. Then, PBDA 
is on average better than CODA, but less accurate than DASVM. However, PBDA is competitive: the 
results are not significantly different from CODA and DASVM. Moreover, we have observed that PBDA is 
significantly faster than CODA and DASVM: these two algorithms are based on costly iterative procedures 
increasing the running time by at least a factor of hve in comparison of PBDA. In fact, the clear advantage 
of PBDA is that we jointly optimize the terms of our bound in one step. 
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Figure 3: Illustration of the decision boundary of PBDA on three rotations angles for fixed parameters 
A = C = 1. The two classes of the source sample are green and pink, and target (unlabeled) sample 
is gray. The bottom plot shows corresponding source and target errors. We intentionally avoid tuning 
PBDA parameters to highlight its inherent adaptation behavior. 

6.5 Combining PBDA and Representation Learning 

As discussed in the introduction, there exist several families of approaches used to tackle the domain 
adaptation problem. The present work focuses on the minimization of a distance metric between the 
source and target distributions. Now, we ask ourselves whether it can be fruitful to combine our PBDA 
algorithm with another approach. To do so, we executed PBDA on top of the Marginalized Stacked 
Denoising Autoencoders (mSDA) introduced by Chen et al. (2012). 

In brief, mSDA is an unsupervised algorithm that learns a new representation of the training samples. 
As a “denoising autoencoders” algorithm, it finds a representation from which one can (approximately) 
reconstruct the original features of an example from its noisy counterpart. The originality of mSDA is to 
learn a representation that allows reconstructing both source and target unlabeled examples. Then, one 
can execute any supervised learning algorithm on the new representation of source samples, for which the 
labels are known. 

That is, given a source sample S = {(xf,j/f)}™! and a target sample T = {(x()}”f;^, mSDA takes 
the unlabeled parts of S and T, {x®,... ,x^,x^,... ,x(,^,}, and learn a feature map f : X ^ X', where 
X' is a new input space (of real-valued vector). In (Chen et ah, 2012), a linear SVM is executed using 
Sf = {(/(x®), j/®)}™;^ as training data, and the hyper-parameter C is selected by standard cross-validation. 

We compare the performance of SVM on mSDA representation to PBDA on the same representations. 
That is, we obtain a new representation of both source Sf = {(/(x|), yl)}^i and target Tf = {(/(x-))}(Ti 
data, using mSDA. Then, we execute PBDA using Sf and Tf. 

This comparison is done using the Amazon reviews dataset. For the sake of comparison, we used the 
dataset pre-processed by Chen et al. (2012), which is slightly different from the one used in Section 6.4. 
Indeed, each domain share the same 5,000 features, and no tf-idf re-weighting is applied. For each pair 
source-target, mSDA representations are generated using a corruption probability of 50% and a number of 
layers of 5. Then, SVM and PBDA are executed on the same representations. 

The results are reported in Table 3. The PBDA algorithm, when we select the hyperparameter by 
reverse cross-validation (PBDA^^^), is not always as good as the cross-validated SVM (SVM'^'^). However, 
by looking closer at the results, we notice that there often exists hyperparameters for which PBDA is better 
on the testing set than the best achievable SVM (as reported by the columns PBDA'^'®'^’^ and SVM^®"®^). 
This suggests that it might be advantageous to mix mSDA and PBDA learning strategies. However, the 
hyperparameters selection is still a challenge in domain adaptation, when we do not have any target labels, 
even if the reverse cross-validation method is a sound strategy. For exploratory purposes, we report on 
Table 3 the risk of PBDA while performing the model selection by standard cross-validation (PBDA*^'^) and 
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Table 2: Error rates for sentiment analysis dataset. B, D, E, K respectively denotes books, DVDs, 
electronics, kitchen. 



PBGDS'^'^ 


DASVM^*^^ 

CODA^'^'^ 

PBDA^'^'^ 

B^D 

0.174 

0.179 

0.193 

0.181 

0.183 

B^E 

0.275 

0.290 

0.226 

0.232 

0.263 

B^K 

0.236 

0.251 

0.179 

0.215 

0.229 

D^B 

0.192 

0.203 

0.202 

0.217 

0.197 

D^E 

0.256 

0.269 

0.186 

0.214 

0.241 

D^K 

0.211 

0.232 

0.183 

0.181 

0.186 

E^B 

0.268 

0.287 

0.305 

0.275 

0.232 

E^D 

0.245 

0.267 

0.214 

0.239 

0.221 

E^K 

0.127 

0.129 

0.149 

0.134 

0.141 

K^B 

0.255 

0.267 

0.259 

0.247 

0.247 

K^D 

0.244 

0.253 

0.198 

0.238 

0.233 

K^E 

0.235 

0.149 

0.157 

0.153 

0.129 

Average 

0.226 

0.231 

0.204 

0.210 

0.208 


Table 3: Error rates for mSDA representations on sentiment analysis dataset. 



SVM*^^ 

PBDA'^^+^C'y 

PBDA^*^^ 

PBDA*^^ 


PBDA^^'^^ 

B^D 

0.172 

0.174 

0.181 

0.174 

1 0.171 

0.170 

B^E 

0.243 

0.235 

0.235 

0.308 

0.221 

0.179 

B^K 

0.189 

0.181 

0.181 

0.185 

0.158 

0.158 

D^B 

0.179 

0.178 

0.178 

0.189 

0.174 

0.175 

D^E 

0.223 

0.233 

0.233 

0.327 

0.195 

0.165 

D^K 

0.152 

0.155 

0.155 

0.163 

0.152 

0.147 

E^B 

0.239 

0.246 

0.246 

0.251 

0.226 

0.233 

E^D 

0.233 

0.232 

0.230 

0.232 

0.225 

0.230 

E^K 

0.128 

0.123 

0.123 

0.133 

0.127 

0.115 

K^B 

0.229 

0.230 

0.230 

0.225 

0.221 

0.217 

K^D 

0.209 

0.216 

0.311 

0.208 

0.209 

0.200 

K^E 

0.138 

0.134 

0.142 

0.134 

1 0.138 

0.133 

Average 

0.195 

0.195 

0.204 

0.211 

0.185 

0.177 


while we consider the mean of the cross-validation and the reverse cross-validation score 

Interestingly, the latter method is a better selection criterion than taking one or the other validation risk 

separately in this experiment, both being misleading in some situations. 


7 Generalization of the PAC-Bayesian Domain Adaptation The¬ 
orems to Multisource Domain Adaptation 

In this section, we generalize our main analysis to multisource domain adaptation. 

7.1 Multisource Domain Adaptation Setting 

We now consider n different source domains over X xY (along with the associated 

marginal distributions over X). In addition to the target m'-sample T with m! unlabeled examples drawn 
i.i.d. from the target marginal Dt, we have one i.i.d. source learning sample Sj per domains Ps^ (possibly 
of different sizes). 

Similarly to Ben-David et al. (2010a), we study this issue when the relationship between the source 
domains and the target one is captured by a distribution v over the set of source domains {Ps This 

is important to point out that experiments on other datasets showed us that the CV+RCV method does not system- 
atically outperform the reverse cross-validation method alone. 
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distribution defines a mixture of source domains that we denote by Pg, and its marginal over X by Dg, 
and corresponds to the set of source samples. On the source domains, we then consider the 

following u-weighted true error of the Gibbs classifier Gpi 


RpgiGp) E Rps^iG,) 

Ps^~v 

= E E Rp„ (h) 

Ps. ~V h~p A 
n 

= E RpsS^)- 

i=i 


Its empirical counterpart is defined as 


Rs^{Gp) = E Rs^h)- 

f=i 


Note that another solution for tackling multisource domain adaptation in a PAC-Bayesian philosophy 
could be to learn different posterior distribution over % from different sources. Indeed, instead of learning 
a shared p on every domain (including the target one), we can learn a model for each domain, and then 
try to learn a good target majority vote over this set of models. In this situation, one could derive a 
PAC-Bayesian analysis similar to the one provided by Pentina and Lampert (2014) for life-long learning. 
However, this setting clearly appears to be not pertinent to extend our one-source domain analysis to 
multiple sources, since they treat the prior distribution as a random variable, which is not our setting. 


7.2 Generalization of the p-Disagreement to Multiple Sources 

One natural solution to generalize the p-disagreement of Definition 1 to the multisource setting described 
in above is to make use of the u-weighted sum of each p-disagreement between a source distribution and 
the target one E^g diSp(I?s^-, Dt), for which we can easily extend Theorem 9. However, we prefer to 
consider the following definition that is clearly tighter than the latter one. 

Definition 2. Let % he a hypothesis class. For marginal distributions {Dsj}^^i and Dp over X, any 
distribution v on any distribution p on H, the domain disagreement diSp{Dg, Dp) between the 

mixture of source distribution Dg and the target distribution Dp is defined by 


diSpiD-g,Dp) "A' 


2 RDrih^h')- E Rosih^h’) 

{h,h')r^p^ L Dsj^v ^ 

RoAGp^Gp)- E i?Bg.(Gp,Gp) . 

Ds-^v 3 


As noticed before, we trivially have 


ddSp{D‘g,Dp) < E diSp(Ds , Dt). 

Ds.~v 


(16) 


Therefore, one can use the various PAC-Bayesian bounds presented in Section 4.1.3 to obtain an 
empirical guarantee over diSp{Dg, Dp) from a collection of observations from each domain. In particular, 
Corollary 2 below is directly obtained from Theorem 7. 

For sake of simplicity, the results presented for the multisource setting suppose that every sample 
shares the same size m. We use the shortcut notation S'" ~ (T 5 )™ to denote the collection of n source 
samples of m examples. That is, S" = where Sj ^ 

Corollary 2. For any distributions {DsAf=i Dp over X, any set of hypotheses FL, any distribution 
V over {DsAAi’ prior distribution tt over H, any S G (0,1], and any real number a > 0, with a 
probability at least 1 — 5 over the choice of S'" ~ (^s)™ ^ ~ (A*t)"*, for every p on H, we have 


diSp{D"g,Dp) < 


2a 

1 - e- 2 “ 


E diSp(S",T) 

Ds-~v 


2 KL(p|| 7 r) -Hn| -Unn ^ 

m X a 
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Proof. We upper bound the right-hand side of Equation (16) by upper-bounding each individual term of 
the expectation using Theorem 7. That is, we bound 


v{Psi)diSp{Si,T), diSp(S'2,T), ..., v{PsJ diSp{Sn,T), 

each one with probability 1 — Thereafter, we regroup these n bounds together to obtain the final result, 
which stands with probability 1 — 5. □ 

The bound given by Corollary 2 can suffer from the inequality of Equation (16). A better generalization 
guarantee is given by Theorem 13 below that bounds directly diSp{Dg, Dt), and does not rely on a term 
“Inn” like we have in Corollary 2. 

Theorem 13. For any distributions {Ds^}'j^i and Dt over X, any set of hypotheses H, any distribution 
V over any prior distribution tt over H, any S G (0,1], and any real number a > 0, with a 

probability at least 1 — 5 over the choice of S'" ~ (^s)™ ^ ~ , for every p on H, we have 


diSp(Dg,£)T) < 


2a 

1 - e- 2 “ 


dis p{S",T) 


2KL(p||7r) -Unf ^ 

m X a 


Proof. Deferred to Appendix E. □ 

Note that Theorem 6, Corollary 1 and Theorem 8 can also be rewritten to bound the multisource 
domain disagreement following the same proof techniques as we used for Theorem 13. 


7.3 Multisource Domain Adaptation Bound for the Stochastic Gibbs Classi¬ 
fier 

Let now generalize the domain adaptation bound of Rpj.{Gp) presented by Theorem 9 to our multisource 
setting. 

Theorem 14. Let % be a hypothesis class. We have 

Vp on H, Vu on {Ps,}]=i, Rp^iGp) < Rp-{Gp) + ^ diSp{Dl,DT) P A" , 
where A" is the deviation between the expected joint error of Gp on the source domains and the target one: 


^ 'U def 
= 


E 


, EA^(/i(x),?/)/:„_^(h'(x), 2 /)- E ^ G^{h{-x),y)L„_,{h'{-x.),y) 

(x,y)~PT {x,y)^Psj 


ep,(Gp,Gp)- E ep (Gp,Gp) . (17) 

Pp~v 

See Equation (6) for the definition of epg_ {Gp, Gp). 

Proof. We follow the same steps as in the proof of Theorem 9. Indeed, from Equation (12), we have 


Rpt {Gp) - Rp:; {Gp) 


-I 

1 

< 

- 2 


(^Rdt{Gp,Gp) - RDs-{Gp,Gp)'^ + ^ep,j,(Gp,Gp) - opg,{Gp,Gp'^ 

i?P),j,(Gp, Gp) — E Rp)^{Gp,Gp) + ep,j,(Gp, Gp) — E epg.(Gp,Gp) 


Ps,~v 


1 


= -diSp(DS,DT) + A;. 


□ 
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7.4 PAC-Bayesian Theorem for Multisource Domain Adaptation 

Building on Theorems 13 and 14, we now present a PAC-Bayesian theorem for multisource domain adap¬ 
tation. 


Theorem 15. For any domains {Psj}^=i o-nd Pt (respectively with marginals {Ds}(j^i and Dx) over 
X X Y, any distribution v over {Psj}'j^i, and for any set H of hypotheses, for any prior distribution tt 
over H, any 5 G (0,1], with a probability at least 1 — 5 over the choice of ~ (Ps)^ ^ ~ , 

for every p over %, we have 


RpAGp) < c'RsAGp) + aAdiSp{S\T) 



KL(p||7 r)+ln| 

m 


+ + 5(' 


1 ), 


where A” is defined by Equation (17), and where c' = - - and o' 

1 — e 


/ def 


2a 


1 — e~ 


Proof. In Theorem 14, replace Rs^ (Gp) and diSp{Dg, Dt) by their upper bound, obtained from Theorem 5 
and Theorem 13, with 6 chosen respectively as | and □ 

Theorem 15 above is a generalization of Theorem 11. It is straightforward to generalize Theorems 10 
and 12 as well to the multisource setting. 


It is important to point out that the above theorem, which naturally generalizes our one-source domain 
analysis, supposes that the distribution v over Pg is fixed (or known). However, we can prove generalization 
bounds that involve v given a prior distribution u over Pg. On the one hand, it is possible to derive a 
result for a distribution p onH fixed. On the other hand, such a result can be also derive on v and p at 
the same time. These two results can be helpful to derive another kind of approach, and we detail and 
discuss these bounds in the in Section 8.1. 


7.5 PBDA for Multisource Domaiu Adaptatiou 

Regarding the results of Section 7, optimizing the PAC-Bayesian multisource domain adaptation bounds 
of Theorem 15 is equivalent to minimize the following trade-off 

CmRs^{GpA + Am diSp^{S'',T) + KL(pw|ko), 


where 


diSpAS\T) 


Rs'^{Gp„„, GpA — Rt[Gp^, GpA 


and S'" = the n source samples coming from the mixture of source 

domains Pg, and T = is the target sample. Given the vectors of weights v = {v{Psj)}Ai 

the source domains, finding the optimal is then equivalent to find the vector w that minimizes 


j=i i=i \ 


w • x" 


-A 


E 


i=i 


V (PsA ^dis 




w • x° 


d^dis 


w • X 


Note that if u is a uniform distribution, i.e., every source domain is equally probable, one can solve the 
above optimization problem using the learning algorithm PBDA of Section 5, with S := Uj=i s,s the 
source sample. In Section 8.1, we discuss the possibility of creating other kinds of learning algorithms, 
namely by learning v, the weights of source distributions. 


8 Discussions 

In this section, we discuss two points related to this paper. Firstly, we present two other results in 
multisource domain adaptation that lead to open-questions related to the deviation of new multisource 
algorithms. Secondly, we point out the differences between our new version of the PAC-Bayesian domain 
adaptation bound (Theorem 9) and the version proposed in Germain et al. (2013). 
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8.1 Other Results for Multiple Source Domain Adaptation 

In Section 7, we studied multisource domain adaptation when we suppose that we know the distribution v 
over Pg. However, this ideal situation cannot be always verified. Then either one can fix v as the uniform 
distribution, or one can learn v given a prior distribution u on Pg. This latter point can be justified by 
the two following theorems. 

Firstly, we can prove a bound similar to Theorem 15, but applied on the distribution v on the source 
domains instead of the distribution p on %. 

Theorem 16. For any domains {Psj}^=i and Pt (respectively with marginals {Ds}j'^i and Dt) over 
XxY, any prior distribution u over {Ps^ }^=i; and for any set H of hypotheses, for any fixed distribution^^ 
TT over H, any d G (0,1], with a probability at least 1 — 5 over the choice of ^ 'P , 

for every v over {Pswe have 


RpriGn) 


< 


c' Rs^G^) + a' I diSp{S\T) + 




KL(u||u) + In I 
m 


+ K +hi: 


1 ), 


2 a 


where A” is defined by Equation (17), and where c' = - - and a' = - - 

^ 1 — e 1 — e 


Proof. Deferred to Appendix F. 


□ 


Secondly, it is possible to prove the same kind of generalization bounds for the distribution v over the 
source domains and the distribution p over R. at the same time. This result is stated in the next theorem. 

Theorem 17. For any domains {Psj}f=i and Pt (respectively with marginals and Dt) over 

X xY, any prior distribution u over {Psj }j=i; and for any set % of hypotheses, for any prior distribution 
TT over H, any S G (0,1], with a probability at least 1 — 5 over the choice of S'" ^ (Ig)™ and T ^ (Dt)"^, 
for every v over {Psj}f=i, and every p over H, we have 


RpAGp) < Rs^{Gp) + a'\diSp{S",T) 


KL(p||7r) + KL(u||u) + In | 




Proof. Deferred to Appendix G. □ 

These two theorems open the door to the conception of two different algorithms for PAC-Bayesian 
multisource domain adaptation when we desire to learn both the distributions v on Pg and p on R. 
On the one hand, Theorem 16 suggests that one could derive a two-step algorithm for PAC-Bayesian 
multisource domain adaptation, according the following principle: 

(i) Given a fixed distribution tt over R, we can learn v by minimizing a trade-off between Ro„(G.,r), 
diSp{S",T) and KL(u||u). 

(ii) Then, for learning p, we simply have to optimize PBDA given this learned v. 

On the other hand. Theorem 17 implies that we can jointly learn v and p by optimizing the trade-off 
between Rs^{Gp), diSp{S",T), KL(u||u) and KL(p||7r). This leads to exciting research directions. 

8.2 Comparison with the first PAC-Bayesian domain adaptation bound 

As said in Section 4, our PAC-Bayesian domain adaptation bound (of Theorem 9) improves the one 
provided in Germain et al. (2013). We recall that our bound is expressed as follows. For every distribution 
p on R, we have 


Rpt (Gp) < Rpg {Gp) -I- - diSp(D5, Dt) -f (Gp, Gp) — epg (Gp, Gp) 


avoid confusion with p that we usually want to learn, we denote this fixed distribution tt. 


(18) 
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Germain et al. (2013) proved the next resultd"^ For every distribution p on "H, we have 

Rp^ (Gp) < Rpg (Gp) + diSp{Ds, Dp) + Rpj,{Gp^) + Rdt (Gp, Gp^) + Rds (Gp, Gp^), (19) 

where = argmin^ Rp^(Gp) is the best distribution on the target domain. 

The improvement of Equation (18) over Equation (19) relies on two main points. On the one hand, our 
new result contains only the half of diSp{Ds, Dp)- On the other hand, contrary to Ap^pj, of Equation (19), 
the term Ap of Equation (18) does not depend anymore on the best pp on the target domain. This implies 
that our new bound is not degenerated when the two distributions Ps and Pp are equal (or very close). 
Gonversely, when Ps = Pp, the bound of Equation (19) gives 

Rpt{Gp) < Rpt{Gp) + Rp^{Gp^) + 2Ro^{Gp, Gp^) , 

which is at least 2Rp,^{Gp*^) . Moreover, the term 2RD^(Gp, Gp*^) is greater than zero for any p when the 
supports of p and over R include at least two different classifiers. 

Finally, note that these improvements do not change the form and the philosophy of the PAC-Bayesian 
theorems of Section 4.2.2, and then of the algorithm PBDA of Section 5. Indeed, the only differences stand 
in A d\Sp{Ds,Dp) and in the value of Ap. 

9 Conclusion and Future Work 

In this paper, we define a domain divergence pseudometric that is based on an average disagreement over 
a set of classifiers, along with consistency bounds for justifying its estimation from samples. This measure 
helps us to derive a first PAC-Bayesian bound for domain adaptation. Moreover, from this bound we 
design a well-founded and competitive algorithm (PBDA) that can jointly optimize the multiple trade-offs 
implied by the bound for linear classifiers. In addition, we generalize our analysis to multisource domain 
adaptation, allowing us to take into account information from different source domains according to their 
relations to the target one. 

We think that this PAC-Bayesian analysis opens the door to develop new domain adaptation methods 
by making use of the possibilities offered by the PAC-Bayesian theory, and gives rise to new interesting 
directions of research, among which the following ones. 

Firstly, the PAC-Bayesian approach allows one to deal with an a priori belief on what are the best 
classifiers; in this paper we opted for a non-informative prior that is a Gaussian centered at the origin of 
the linear classifier space. The question of finding a relevant prior in a domain adaptation situation is an 
exciting direction which could also be exploited when some few target labels are available. Moreover, as 
pointed out by Pentina and Lamport (2014), this notion of prior distribution could modelize information 
learned from previous tasks. This suggests that we can extend our multisource analysis to issues related 
to lifelong learning where the objective is to perform well on future tasks, for which so far no data has 
been observed (Thrun and Mitchell, 1995). 

Another promising issue is to address the problem of the hyperparameter selection. Indeed, the 
adaptation capability of our algorithm PBDA could be even put further with a specific PAC-Bayesian 
validation procedure. An idea would be to propose a kind of (reverse) validation technique that takes 
into account some particular prior distributions. Another possible solution could be to explicitly control 
the neglected term in the domain adaptation bound. This is also linked with model selection for domain 
adaptation tasks. 

Besides, deriving a result similar to Equation (4) (the C-bound) for domain adaptation could be of high 
interest. Indeed, such an approach considers the first two moments of the margin of the weighted majority 
vote. This could help us to take into account both a kind of margin information over unlabeled data and 
the distribution disagreement (these two elements seem of crucial importance in domain adaptation). 
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A Some Tools 


Lemma 1 (Markov’s inequality). Let Z be a random variable and t > 0, then 

P{\Z\>t) < E {\Z\)/t. 

Lemma 2 (Jensen’s inequality). Let Z be an integrable real-valued random variable and g{-) any function. 
U9{') is eonvex, then 


U9{') is eoncave, then 


g(E [Z]) < E [g{Z )]. 


g(E [Z]) > E [g{Z )]. 


Lemma 3 (Maurer (2004)). Let X = (Ai,..., A^) be a veetor of i.i.d. random variables, 0 < A^ < 1, 
with E Ai = ^. Denote X' = (A(,..., A),^), where X[ is the unigue Bernoulli ({0,1} -valued) random 

variable with E A' = p,. If f : [0,1]" —)■ K is convex, then 

E [/(A)] < E [/(A')] . 

Lemma 4 (from Inequalities (1) and (2) of Maurer (2004)). Let m > 8, and X = {Xi,..., Xm) be a 
veetor of i.i.d. random variables, 0 < A^ < 1. Then 


m < E exp 


TO 




2 = 1 


< 2 ^/rn. 


Lemma 5 (Change of measure inequality). For any set TL, for any distributions tt and p on FL, and for 
any measurable function (f> : B ^ M., we have 


E (fif) < KL(p|j 7 r)+ln ( E . 

f~p ) 


Lemma 6 . Given any set TL, and any distributions tt and p on TL, let p and tt two distributions over 'Hf 
such that p(h, h') p(h)p{h') and Tt{h, h') 7r(/i)7r(h'). Then 


Proof. 


KL(p|| 7 r) = 2 KL(p|| 7 r). 

KUM*) = E 

{h,h')~p^ 7r(h)7r(/i') 


= E In 


p{h) 


E In 


p{h') 


h^p TT{h) h'~p 7r(/l') 

= 

/t~p TTyh) 

= 2KL(p||^). 
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B Proof of Theorem 6 

Proof. Firstly, we propose to upper-bound 


by its empirical counterpart 


rf(i) 11' E 

{h,h')r^P^ . 


,( 1 ) 


Rds ~ Rdt (^) ^') 


{h,h')~p^ L 


Rg{h, h!) — Rx{h, h') 


To achieve this, we consider an “abstract” classifier h = {h, h') € R? chosen according a distribution p, 
with p{h) = p{h)p{h'). Let us define the “abstract” loss of h on a pair of examples (x®,x*) ~ Dsxt = 
Dg X Dx by 

^ .S def 1-f 4j(/l(x'*),h'(x"))-/:„.,(h(x*),h'(x*)) 

Lrf(i) (/i, X , X ) = ---. 


Therefore, the “abstract” risk of h on the joint distribution is defined as 

_S 


= E E £,u)(/^,x^x‘) 


which empirical counterpart is 


RsItW = , , E /:d(i)(h,x",x*). 

(X'^ ,X‘'j~o X 1 


The error of the related Gibbs classiher of these two quantities are 
?(l) _ TP r?(i) z?(i) 


^dLJGp) = JE R'ni^Jh) and = E . 

h'^p h'^p 


It is easy to show that 


rfd) = (Gp) - 1 and 4'^ , = 24'4(Gp) - 1. 


(20) 


( 21 ) 


Now, let us consider the non-negative random variable E e’"’^*(^sxT(^)|4i3sxT*'^0 , 

h~7r 

We apply Markov’s inequality (Lemma 1). For every S G (0,1], with a probability at least 1 — 5 over 
the choice of S' x T ~ {Dsxt)'^, we have 


E e 


mkl(_R^),,p(^) 144 XT 


<1 E E 

/i~'7r 5 SxT^{Ds xt)^ h^TT 

_ 1 E E g™’^*(4xT(^)|44xT^^^) 

5 hr.j-K SxT~(Dsxt)™ 

< - E 2^/m, 

0 h'^TT 

where the last inequality comes from the Maurer’s lemma (Lemma 4). 

By taking the logarithm of each outermost sides of the previous equation, we then obtain 


In 


E e 




< In 


5 ■ 


Let us now find a lower bound of the left side of the last equation by using the change of measure 
inequality (Lemma 5) and the Jensen inequality (Lemma 2) on the convex function kl(-||-). We have 


In 


_E e 


mkl( -Rsxt(^) II (^)) 


> E 


mkl [R’ilxih) II i?Sx44) - KL(p||7r) 


> mkl ( E i?4^(h) II E ) - KL(p||7r) 

\h^p h^p / 

= mkl (4'4(Gp) II 4L4 G'p)) - 2KL(p||4 . 
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Note that the last equality is obtained from Equation (20) and Lemma 6. 
We finally obtain 

2 KL(p |j tt) + In 


kl(4'^r(Gp)||41,(Gp)) < - 


2^/rn 


With Equation (21), the previous line gives us a bound on from its empirical counterpart 44- 
Hence, with probability at least 1 — <5 over the choice of S' x T ^ {Ds x 


kl 

Lemma 7 (stated below) gives 
kl 


,(i) 


-,+1 



1 r 

< — 

" J 

m 


2 KL(p II tt) + In 


2^/m 


Ms X tI + 1 


2 


1 

< — 
m 


2 KL(p 11 tt) + In 


2y/m 


which, since = AiSp{Ds,Dt) and \d^/l t\ = diSp(S, T), implies the result. 
Lemma 7. For a,b G [—1, +1], we have 


kl 




Proof. There are four cases to consider. 

Case 1: Let a > 0 and 6 > 0. 

This first case is trivial, since |a| = a and |ti| = b. 


□ 


Case 2: Let a < 0 and 6 < 0. 

This case reduces to Case 1 because kl(( 7 ||p) = kl(l —g||l—p) for all {q,p) G [0,1]^ . 
Then 

kl(4^ 



Case 3: Let a < 0 and 5 > 0. 

Erom straightforward calculations, we show that 



< 0 . 


Case 4: Let a > 0 and 6 < 0. 

This case reduces to Case 3, since kl(g|jp) = kl(l —qjl 1—p) for all (g,p) G [0,1]^ . 
Hence, 





< kl 


V 2 2 ) 


□ 
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C Detailed Proof of Theorem 7 

Proof. Similarly as in the proof of Theorem 6 (see Appendix B), we will first bound 

= E \RDs{Kh') - RoAKh') 

{h,h')r^P^ L 

by its empirical counterpart. 

Refer to the proof of Theorem 6 for the definitions of and as well as their empirical 

counterparts R^slxW '^Sxt(^p)' 

As Cci(i) lies in [0,1], we can bound following the proof process of Theorem 5 (with c = 2a). 

To do so, we define the convex function, 


P{x) — ln[l —(1 —e ^“)a;], 

and consider the non-negative random variable E '^‘^RsxtW) ^ 


( 22 ) 


We apply Markov’s inequality (Lemma 1). For every S G (0,1], with a probability at least 1 — f over 
the choice of S' x T ~ {Dsxt)'^, we have 

6 SxT~(Dsxr)’" h~Tr 


h'^TT 


2 „ mRiR<-^l^(h)) 


= xT e 

^ h~7r 


osxt'-'"" e e 

SxT~(r>sxT)’" 


— 27naR^g)rj.{h) 


By taking the logarithm on each side of the previous inequality, we obtain 
In 



= In 


hr^Tt 


_^^~'7T SxT~(D5xt)’^ 


(23) 


For a classifier h, let us define a random variable that follows a binomial distribution of m trials 


with a probability of success i?^^^^(/i) denoted by B(rn,R}-^^^^^(h)). Lemma 3 gives 


( 1 ) 


E g 2maR^g)j.{h) ^ 


SxT-(Dsxt)" 


E 

Xf^^B(m,R))\ (h)) 




= E 


Pr 


—0 Xf^~B{m,R^^l^^(h)) 
m 

= E (r)(4E(M)'(i - 


{Xk = k} 


— 2ctk 


k^O 


m 

E(T)(4E(%-^“)'(i-fi, 


(1) 

SxT 


k=0 


'i — k 


— R^g^rp{h)e -I- Rsxt 
The last line result, together with the choice of P (Equation (22)), leads to 


E e 

h'~^7r 




dsxt' ” E e 

SxTr^iDsxT)^ 


— 2maR'^g)rp(h) 


< _E e 

h'^TT 

= .E 1 = 1. 

h'^TT 




+ (l - 


We can now upper bound Equation (23) simply by 

In E e 

hr^Tt 
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Let us now find a lower bound of the left side of the last equation by using the change of measure inequality 
(Lemma 5) and the Jensen’s inequality (Lemma 2) on the convex function T'. 


In 


_E e 

/i~7r 




> E 

hr^p 

> m 


^ W) - 2a4'xT(M) - KL(p||7r) 


J-( _E_i?«^^(h)) -2aE 


h'-^p 


(1) 


h'-^p 


- KL(p||7r) 


= mJ^(i?W^^(G,5))-2ma4')^(Gp)-2KL(p||7r)) . 
The last equality is obtained from Equation (20) and Lemma 6. This, in turn, implies 


(1) , 2KL(p||7r) + ln-, 


^RdLAGp)) < 2aR^sUGp) 

Now, by isolating i?^^^^(Gp), we obtain 


R^n (Gp) < -X 

^Sxt'- \ _ g —2 

and, from the inequality 1 — e~^ < x , 

<L.(Gp)< 


1 — e 


-2a 


^ + i(2KL(p||7r)+ln f )) 


2o4A(g,)+ 


It then follows from Equation (21) that, with probability at least 1 — f over the choice of S' x T 
{Dg X Dt)^, we have 


+ 1 ^ 2a 


We now bound 


1 — e“ 


rf(2) 11^ E 


rfW. + l , 2KL(p|K)+lnf 


+ 


m X 2a 


(24) 


Rdt ~ Rds (^) 


using exactly the same argument as for except that we instead consider the following “abstract” loss 
of h on a pair of examples (x®, x‘) ^ Dsxt = Dg x Dt ■ 

r ft def 1 + /:o.i(/l(x‘),h'(x‘) -/:„j(h(x®),h'(x®))) 

Lrf( 2 ) (/l, X , X ) = -^- . 

We then obtain, with probability at least 1 — § over the choice of S x T ~ {Dg x Dt)"^, 


d(2) + 1 ^ 2a 


1 — e 


-2a 


4x T + 1 2KL(pH7r) +lnf 
2 m X 2a 


(25) 


To finish the proof, note that by definition, we have that d^') = —d^‘^\ Hence, we have 

|d(i)| = |d(2)| =diSp(Gs,GT), and |4'4| = |44| = cliSp(S, T). 

Then, the maximum of the bound on d^^^ (Equation (24)) and the bound on (Equation (25)) gives a 
bound on diSp(D 5 , Gy). By the union bound, with probability 1 — J over the choice of SxT ~ {DgxDx)"^, 
we have 


+ 1 a 


< 


1 — e 


-2a 


|4 


1 + 


2 KL(p||7r) + In ^ 


m X a 


or, which is equivalent to 


diSp(Gs,GT) < 


2a 

1 - e-2“ 


diSp (S,T) 


2KL(p||7r) + In I 


5 , 1 


m X a 


- 1 . 


and we are done. 


□ 
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D Proof of Theorem 8 


Proof. Let us consider the non-negative random variable g^rn{RDs{h,h'} Rs{h,h'))^ ^ 

{h,h’)~-K^ 

We apply Markov’s inequality (Lemma 1). For every S € (0,1], with a probability at least 1 — | over 
the choice of S' ^ (Ds)"^, we have 




^2m{Ri:)^{h,h') — Rs{h,h'))‘^ <^ _ g2m{i?£)5 


|5S~(Ds)"* 

_ — E E — 

S{h,h')~Tv^ S~(Ds)"' 

< ^ E E gkl{Rs(h,h')\\Rns{h,h')) 

~ S{h,h')~Tr 2 S^iDs)"' 

2 ^ 

< - E 2y/m. 


(26) 

(27) 


Line (26) comes from Pinsker’s inequality, and Line (27) comes from the Maurer’s lemma (Lemma 4). By 
taking the logarithm on each outermost side of the previous inequality, we obtain 




In E g2m.(RDg(h,h')-Rs(h,h')f ^ 


(28) 


Let us now find a lower bound of the left side of the last equation by using the change of measure 
inequality (Lemma 5) and the Jensen inequality (Lemma 2). 

In E ^ 2 m{RDg{h,h')-Rs{h,h'))'^ 

(h,h')r^7Z^ 

> E 2m{RDs{Kh') - Rsih,h')f-KL{ p^\\tt^) 

> 2TOf E RDsiKh')- E RsiKh')] -KL(p2||7r2) 

= 2m(^RDsiGp,Gp) - Rs{Gp,Gp)y - 2KL(p||7 . 

The last equality is obtained from Equation (20) and Lemma 6. We finally obtain 

2m(^RDsiGp,Gp) - RsiGp,Gp)y < 2KL(p||7+ln ^ 


m 


and we conclude, with a probability at least 1 — f over the choice of S ^ {DsY 


Rds (Gp, Gp) — RsiGp, Gp) 


< 

,77 


y 2 m 


2 KL(p |j 7r) -I- In 


(29) 


Following the exact same proof process with the random variable E (Rn.j.(h,h ) RT{h,h )) ^ obtain, 
with a probability at least 1 — f over the choice of T ^ , 


Rdt{Gp, Gp) — Rt{Gp, Gp) 



1 

■ \ 

2m' 


2 KL(p II tt) -|- In 


4v7i7 


(30) 


Joining Inequalities (29) and (30) with the union bound (that assure that both results hold simulta¬ 
neously with probability 1 — J), gives the result because 

Rds {Gp, Gp) — Rdt {Gp, Gp) = diSp(Ds, Dt) , 

Rs{Gp,Gp)-RT{Gp,Gp) =diSp(5,T), 

and because if |ai — &i| < ci and |a 2 — 62 I < C 2 , then |(ai — 02 ) — {bi — ^ 2 )! < ci -I- C 2 . □ 
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E Proof of Theorem 13 

Proof. The proof follow all the steps of the proof of Theorem 7 (see Appendix C). The only difference is 
that, in order to obtain a guarantee over diSp{Dg, Dt), we bound 


E RosiKh’)- RoAh.h') 

Ds- ~v J 

by its empirical counterpart 

E RsAh,h')-RT{h,h') . 

Ds^^v 

To do so, we define the “abstract” loss of h (h, h') € R? on a tuple of n +1 examples , • ■ ■, x®", x*) ~ 
X ... X Dsn ^ by 


7 ( 1 ) 


E 

(h,h')^ 


(i(l) = E 




1 


-dh) I-- / 2 

Again, we obtain the result by following the proof of Theorem 7. 


1+ E /:„_^(/i(x®o,/i'(x®o) -/:„j(/i(x*),h'(x*)) 

Dsd^v 


□ 


F Proof of Theorem 16 


We first need the following result. 

Theorem 18. For any distributions and Dt over X, any set of hypothesis R, for any prior 

distribution u over any distribution tt over R, any 5 G (0,1], and any real number a > 0, with 

a probability at least 1 — 5 over the choice of ^ D'g, and T ~ , for every distribution v over 

we have 


dis^{D-S,DT) < 


2a 

1 - e-2“ 


clis^(5",T) 


KL(u||m) +lnf ^ 

n X a 


Proof. The proof follows a process similar to the proof of Theorem 7 in Appendix C: we separately bound 


RoAGp.Gp)- ^RDs.{Gp,Gp) 
Ds- ~v J 


and 


Ds 


^Rds\Gp.Gp)-RdAGp.Gp), 


by rescaling their value into [0,1]. 

Then, we easily obtain the result of Theorem 16. 


□ 


Proof, of Theorem 16 In Theorem 14, replace Rs^{Gp) and diSp{Dg, Dt) by their upper bound, 
obtained from Theorem 5 applied on i?p|(G,r) = ^Ps-'^v R-Ps^ (G^r) (instead of Rpg{Gp)) and Theorem 18, 
with 6 chosen respectively as | and □ 


G Proof of Theorem 17 


Proof. Consider the data distribution V = Psi x Ps 2 x ... x Pg^. The loss of a classifier h G R on a 
tuple of examples ( (xi, 2 /i),..., {Xn,yn)) ^ P is defined as the mean of the zero-loss {h{x.j),yj) on 
each example of the tuple {i.e., j € {1,..., n}). 

Thanks to this convention, and by a slight abuse of notation, we can write the expected risk on 7^ of a 
classifier h gR as, 


R-p (h) 


= E 

((xi.yi),...,(x„,y„))~-p 
1 " 


1 -J X 
i=i 
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and the expected disagreement of a pair of classifiers (h, h') G on the corresponding marginal distribu¬ 
tion V Dsi X Ds^ X ... X Ds^ as 

1 "■ 

Rvih.h') E -^/:„(h(x,),h'(x,)) 

n 

Let now define new posterior py and prior 7r„ ondi: 

n n 

Py{h) = p{h)Y, v{Ps^) and nu{h) = Tr{h) 

i=i i=i 

From above definitions, one can easily show 

Pp^iGp) = Pv{GpJ , and diSp(D 5 , Dr) = diSp^(X>, Dt) . 

Moreover, we have 

KL(p„||7ru) = 


E In^ 

h'^pu 

n 

i=i 




:(h) 


u{Ps,) 

v{Ps,) 


E \nf^+^v{Ps-)\n , , 

h^p Tr{h) ^ " u{Psj) 

KL(p||7r) -I- KL(u||u). 


From Theorem 11, with a probability at least 1 — 5 over the choice of 5 x T ~ (D x Dt)"^, for every 
posterior distribution on H, we have 


Rpry{Gp) < c' Rs{Gp^) + a' idiSp^(5,T) + ( + yy 


d , a'\ KL(p.„||7r„) -h In | 


+ Ap^ -|- ^ (a^ — 1), 


and we obtain the final result by the substitution of Rs{Gp^), diSp^(5,T), and KL(pi,||7r„) with their 
equivalent expression. □ 
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